1. Introduction

In Project 1, we examined the AirBnB dataset and conducted basic exploratory data analysis (EDA) with statistical inference to draw baseline relationships between different variables. While exercises in correlation using t-tests, chi-squared tests, and ANOVA tests yielded interesting results, we wanted to dive deeper into the stories behind the numbers. For that, we turn to regressions in order to determine causation between variables and introduce a dataset on crime in order to better extrapolate results to the broader population of AirBnBs. The following paper will continue by describing the data sources for the two dataset. The following section will provide a high-level EDA for the AirBnB and crime datasets. The paper will proceed by conducting a variety of regression techniques, beginning first with AirBnB variables only, examining what factors affect listing prices. Then, second, we overlay the crime data onto the AirBnB data to investigate the effects of crime on prices. The penultimate section of this paper will compare the regression results against one another in order to determine the best model(s). Lastly, this paper will conclude with a summary of findings and areas for further research.

2. Data Sources

In this section, we provide a summary of the data sources used in this analysis.

2.1 AirBnB

The AirBnB data for this project is from InsideAirbnb.com. This website contains information sourced by Murray Cox, who utilized Airbnb’s public application programming interface (API) to mine this data. Originally, Cox scraped this data to identify illegal listings in New York City. He has since expanded his data set offerings to cities across the world and makes this datum available for open source use and research.

Although Cox is potentially a biased source, due to his activist leanings, his datasets originate from AirBnB themselves and are thoroughly documented. We also considered using data from AirBnB directly, however, other studies have shown that this data is outdated and biased in that it only shows the positive side of AirBnB. Mr. Cox, however, seeks to use the company’s own data is that he scraped from the website itself through a well-documented procedure to explore how Airbnb is really affecting our community. Therefore, we decided that the dataset was reliable because of the author’s documentation and his purpose for releasing it.

2.2 Crime Data

The crime dataset is sourced from OpenData DC. The dataset contains a subset of locations and attributes of incidents reported in the ASAP (Analytical Services Application) crime report database by the District of Columbia Metropolitan Police Department (MPD).

This data is shared via an automated process where addresses are geocoded to the District’s Master Address Repository and assigned to the appropriate street block. Block locations for some crime points could not be automatically assigned resulting in (0,0) for (x,y) coordinates.

3. Exploratory Data Analysis

To begin, we present various summary statistics for the two datasets (Airbnb and crime) that we are investigating. Below are the structure printouts for both datasets, beginning with Airbnb, followed by crime:

In the Airbnb (“listings”) dataset, there are 9,126 observations and 17 variables. The Airbnb dataset is relatively comprehensive, consisting of various qualitative variables including unique ID, name, and neighbourhood. Additionally, there are a number of quantitative variables such as price and number of reviews. More importantly, the listings dataset contains latitude and longitude coordinates that will serve as the link to the crime dataset.

On the other hand, the crime dataset contains 29,045 observations and 26 variables. Furthermore, this dataset shows 9 types of offenses as well as the method of crime (gun, knife, others) and time of day (day, evening, midnight). Similarly, the crime dataset is labeled by latitude and longitude coordinates as well as census tract, which will be important variables for joining the two datasets together.

3.1 Summary Statistics

An important step to EDA is exploring the data through summary statistics. Since we have already examined the listings dataset thoroughly in Project 1, the following section will primarily focus on the crime dataset. Since we are interested in how crime levels affect Airbnb prices, it is important to take a closer look at the different types of crime. Below is a summary table of the number of crimes by offense and ward:

WARD ARSON ASSAULT W/DANGEROUS WEAPON BURGLARY HOMICIDE MOTOR VEHICLE THEFT ROBBERY SEX ABUSE THEFT F/AUTO THEFT/OTHER TOTAL
1 1 151 117 15 221 345 17 1443 1806 4116
2 1 119 168 0 225 232 26 1898 3467 6136
3 0 21 72 3 89 35 10 566 898 1694
4 0 60 110 4 176 136 13 857 863 2219
5 1 241 208 17 347 312 22 1441 1729 4318
6 2 150 119 16 248 309 23 1625 2411 4903
7 0 324 137 38 421 317 30 756 1191 3214
8 3 320 175 54 249 257 37 521 829 2445

Ward 2 has the most amount of crimes (6,136), followed by Ward 6 at 4,903 crimes. Moreover, crime type “Theft/Other” is the most common in all wards as seen in the bar chart above.

3.2 Data Visualization

To better understand the crime data and the underlying relationships, data visualization is a useful tool. This section presents two charts: a bar chart and pie chart. The bar chart is presented below:

Additionally, the same data can be visualized as a pie chart, which shows the percentages of the total number of crimes relative to each ward:

These summary statistics and charts are particularly important in the context of Airbnb listings. It is not unreasonable to hypothesize that wards with higher number of crimes overall may also exhibit an adverse effect on listing prices. As fewer people want to live in those areas, demand for Airbnbs decrease and in turn, so do prices. Further analysis using regression techniques will be needed to determine the overall effect of crime on prices.

4. Regression Models

After conducting EDA and looking at the variables at a high-level, we move onto generating regression models to estimate causal relationships between the two datasets. In this section, we examine five models using the techniques learned in class, beginning with a simple linear regression model within the Airbnb dataset alone. Then we move to a logistic regression in the same dataset. In the third model, we overlay the crime dataset onto the listings dataset in order to explore how crime affects Airbnb listing prices. The penultimate model consists of a hedonic regression used to predict price. Lastly, we implement machine learning methodology to breakdown the primary drivers of price.

4.1 Linear Regression Model

[POLLY’S MODEL]

4.2 Logistic Regression Model

[ELISE’S MODEL]

4.3 Crime and Listing Regressions

One key area of interest is examining the connections between crime and AirBnB prices. There are two primary hypotheses on how crime affects listing prices: (1) higher crime rates reduce demand and lower listing prices; and (2) crime is targeted in wealthier neighborhoods that have higher listing prices.

The conventional thought process is that higher crime areas will drive potential customers away. As a result, there will be less overall demand for AirBnBs in that neighborhood, leading to a decline in prices. While this mechanism makes sense at face value, deeper thinking about areas where crime exists and is most prevalent may point to a different directionality. One may argue that more affluent, urban neighborhoods may exhibit higher crime rates – especially in terms of home or auto theft. Assuming that wealthier and more accessible neighborhoods will have higher AirBnB listing prices, then higher crime rates is a subsequent reaction to higher prices, contrasting the conventional mentality.

To test these hypotheses and to determine which one explains the true relationship, we will first merge the crime dataset with the listing dataset and then construct regression models to estimate the effect of crime on listing prices. Prior to any regression analysis, it is always important to take a cursory glance at the summary statistics and simple data visuals in order to get a sense of how the variables relate to one another. Below is a breakdown of the percentage of crimes by zip code:

As shown above, even within the the sample of the ten zip codes with the most crime occurences, the first zip code (20004; 1,145 total crimes) has more than double the number of crimes than the tenth zip code (20002; 504 total crimes). Turning our attention to how prices differ, we take a look at the summary statistics for listing price in these two neighborhoods, first in the form of a five number summary, then in the form of a boxplot:

Summary Statistics 20002 (low) 20004 (high) 20009 (low) 20010 (high)
Price            
   min 29 48 20 30
   max 1450 2900 3200 10000
   median 115 194 120 99
   mean (sd) 171.55 ± 190.82 260.19 ± 313.26 176.66 ± 232.62 181.26 ± 608.26

Interestingly enough, looking at the basic summary tables of the average prices in the two highest crime zip codes (20004 and 20010) and lowest crime zip codes (20002 and 20009), the former has a higher average listing price. This result suggests that perhaps contrary to conventional wisdom, it is indeed the case that crime is targeted in more affluent neighborhoods, if we assume that higher listing prices are correlated with weathlier neighborhoods (that may be the subject of a future study). Of course, this table and subsquent boxplot only captures a very small sample – too small to make any substantive claims about crime and listing price. In fact, if we had any such claims, we would be falling into the classic trope of correlation implying causation, which we have well learned to be false. As such, we now turn our attention to the most important topic on hand: regressions.

We begin first by building a simple regression model to estimate the effect of crime on listing price. After assessing this preliminary model using various evaluation techniques, we aim to tune the model, either through linearization or adding additional relevant regressors to the model. Below are the regression results from the simple linear regression of price vs. total crime:


Calls:
Model 1: lm(formula = price ~ total_crimes, data = lm1_input)

==========================
  Constant    237.696***  
              (11.740)    
  Crimes       -0.024     
               (0.020)    
--------------------------
  R-squared     0.000     
  F             1.432     
  p             0.232     
  N          4690         
==========================
  Significance:   
                *** = p < 0.001;   
                ** = p < 0.01;   
                * = p < 0.05  

The OLS estimate for the number of crimes on listing prices produces a coefficient of -0.024, suggesting that increasing the number of crimes by 1 will decrease the listing price by 2.4 cents. However, this coefficient is not statistically significantly different from 0. Additionally, this model yields an R2 value of 0.00, indicating that only none of the variation in price is captured in this regression specification. Moreover, the p-value is much larger than the alpha level of 0.05, suggesting that the overall model is not significant.

While the estimated effect of total crime on price does not show any significant results and therefore cannot shed light on either of the hypotheses, the sign of the coefficient does suggest that the conventional theory might be the true story. However, the low R2 value shows that there is still much work that can be done to improve the model. Now that we’ve established the connection between crime and prices, it is also interesting to determine what type of crime affects prices the most. Below is the regression output of price vs. type of crime:


Calls:
Model 1: lm(formula = price ~ total_crimes, data = lm1_input)
Model 2: lm(formula = price ~ theftOther_rate + theftAuto_rate + robbery_rate + 
    motorTheft_rate, data = lm1_input)
Model 3: lm(formula = price ~ theftOther_rate + theftAuto_rate + robbery_rate + 
    motorTheft_rate + number_of_reviews + as.factor(room_type), 
    data = lm1_input)

=======================================================================
                                  Model 1      Model 2      Model 3    
-----------------------------------------------------------------------
  Constant                       237.696***   253.305***   321.940***  
                                 (11.740)     (13.274)     (13.703)    
  Crimes                          -0.024                               
                                  (0.020)                              
  Theft (other)                                32.073***    18.831**   
                                               (6.375)      (6.281)    
  Theft (from auto)                             3.059       -0.642     
                                               (7.815)      (7.646)    
  Robbery                                    -223.136***   -74.026     
                                              (66.169)     (65.246)    
  Motor Theft                                -215.027*    -255.909*    
                                             (103.999)    (101.468)    
  Number of Reviews                                         -0.750***  
                                                            (0.081)    
  Private Room/Entire Home/Apt                            -148.542***  
                                                           (13.004)    
  Shared Room/Entire Home/Apt                             -214.810***  
                                                           (32.436)    
-----------------------------------------------------------------------
  R-squared                        0.000        0.010        0.059     
  F                                1.432       11.910       42.195     
  p                                0.232        0.000        0.000     
  N                             4690         4689         4689         
=======================================================================
  Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05  

Looking at Model 2, which breaks down each of the crime categories and incorporates them into the regression specification, the estimates yield a fascinating result. Examining only the statistically significant coefficients shows that robbery and motor theft negatively affect listing prices, while home theft (other theft is equivalent to home theft) increases listing prices. More specifically, a one percent increase in robbery and motor theft rate leads to an estimated decrease in listing price by $223.14 and $215.03, respectively. On the other hand, a similar one percent increase in home theft leads to an increase of $32.07 in listing prices.
The coefficients seem to suggest that both hypotheses may have some merit in this discussion. It is intuitive that neighborhoods with higher robbery and motor theft rates will have lower listing prices. This result corroborates with the idea that fewer people will want to live in neighborhoods where personal and property safety is at risk. However, it also makes sense that home theft may be associated with higher listing prices. After all, wealthier neighborhoods with more luxury goods at home may very well be bigger targets for home invasion. The last category of theft from automobiles is not statistically significant, which is reasonable as theft from cars likely is not associated with listing prices.

To combine previous models with crime, we have also included the number of reviews and type of room into the regression equation. Unsurprisingly, private rooms and shared rooms lead to a signficantly lower listing price compared to entire homes/apartments. Somewhat more surprising is the fact that increasing the number of reviews actually leads to a decrease in prices. This result may be due to the fact that bad experiences (i.e., ones where guests would be compelled to write a review) may far outnumber good experiences and as such, prices are lower for poorly reviewed listings. More interesting is the fact that adding these two variables causes crime to be reduced to near insignificance. Home theft and motor theft remain signficant, with the effect of home theft nearly halved and motor theft increasing its effect by $40. This result suggests that listing type may be a larger driver of listing price than crime rates.

4.4 Hedonic Regression Model

[PANCEA’S MODEL]

4.5 Machine Learning

[MATT’S MODEL]

5. Conclusion

[insert conclusion]